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Individuals store as much information these days as companies did a few 
years ago, companies store as much as governments did, and governments store 
a good deal more than many people think they should. As much as we store, 
though, even more floods past every minute and disappears, and algorithms to 
sift through these massive data streams have become a hot topic in recent years. 
If we consider streaming algorithms to be those that work online and in sub- 
linear memory, [5] then the well-known LZ77 [9] compression algorithm becomes 
one when implemented with a sublincar sliding window. It follows from Wyner 
and Ziv's analysis [8J that LZ77 with a superconstant window is still universal 
with respect to finite-order Markov sources, and it is easy to see universality 
is impossible with constant memory. Therefore, in some sense, for thirty years 
we have had an optimal algorithm for compressing data streams. Nevertheless, 
common sense says having the window grow too slowly must result in poor com- 
pression and, indeed, it provably slows the compression ratio's convergence to 
the entropy. 

In a previous paper [2] we showed how LZ77 can be implemented with a 
slightly sublinear window and the same upper bound on convergence as plain 
LZ77, but that this is impossible with a sublogarithmic window. Plain LZ77 itself 
converges fairly slowly, however, so we eventually turned our attention to other 
algorithms. Grossi, Gupta and Vitter [3] recently proved a very strong upper 
bound for an algorithm based on the Burrows- Wheeler Transform: 

Theorem 1. Let s be a string of length n over an alphabet of constant size a. 
Using 0(n) time we can always encode s in nHk(s) + 0(a k logn) bits for all 
integers fc > simultaneously. 

The fcth-order empirical entropy H k (s) of s (see, e.g., [I]) is its minimum self- 
information per character with respect to a fcth-order Markov source. Equiv- 
alently, it measures our expected uncertainty about a character of s given a 
context of length fc, as in the following experiment: i is chosen uniformly at ran- 
dom between 1 and n; if i < fc, then we arc told the ith character of s; otherwise, 
we are told the fc preceding characters and asked to guess the ith. Therefore, by 
the Noiseless Coding Theorem [7], we need at least nHk(s) bits to encode s with 
an algorithm that uses only contexts of length at most fc. 

In this note we use two facts about empirical entropy: first, nHk(s) is super- 
additive, i.e., \siS2\Hk(siS2) > \si\Hk{si) + |s 2 |-fffc(s2); second, if every occur- 
rence in s of each fc-tuple is followed by the same distinct character (or the end 
of the string), then Hk(s) = 0. The first fact means if we break s into b blocks 



and encode each block separately with Grossi, Gupta and Vitter's algorithm, 
then the bound on the encoding's total length is nftk(s) + 0(ba k logn) for all 
integers k > simultaneously. The second fact means a de Bruijn sequence of 
order k — or a string consisting of any number of repetitions of all but the last 
fc — 1 characters of such a sequence — has fcth-order empirical entropy 0. A c-ary 
de Bruijn sequence of order k contains every possible /c-tuple exactly once and, 
thus, contains a k + k — 1 characters, of which the first and last k — 1 are the 
same. Over sixty years ago de Bruijn [T| counted the number of binary de Bruijn 
sequences of order k, and five years ago Rosenfeld ^6] generalized his result: 

Theorem 2. There are (cr!) CT a-ary de Bruijn sequences of order k. 

With these two theorems we can easily prove a nearly tight trade-off between 
memory and redundancy: 

Theorem 3. Let s be a string of length n over an alphabet of constant size a 
and let c and e be constants with 1 > c > and e > 0. Using 0(n) time, 0(n c ) 
bits of memory and one pass we can always encode s in nH^s) + 0(a k n 1 ~ c+e ) 
bits for all integers k > simultaneously. On the other hand, even with unlimited 
time, using 0{n c ) bits of memory and one pass we cannot always encode s in 
0(nH k (s) +a k n 1 - c - e ) bits for, e.g., k= |"(c + e/2) log ff n] . 

Proof. We first prove the upper bound. Suppose we are given n in advance. Let 
A denote Grossi, Gupta and Vitter's algorithm. Although A itself is not one- 
pass, we use it as a subroutine in a one-pass algorithm as follows: we process s 
in 0(n 1 ~ c+£ / 2 ) blocks si, . . . , Sb, each of length 0(n c ~ c / 2 ); we read each block Si 
into memory in turn, compute and output A(si), and erase Si from memory. Since 
A takes linear time and, thus, memory at most proportional to the input size 
times the word size, we compute A(si), . . . , A(sb) using 0(n) time, 0(n c ) bits of 
memory and one pass. As we noted above, by superadditivity the whole encoding 
is at most nHk(s) + 0(a k n 1 ~ c+e ) bits for all integers k > simultaneously. Now 
suppose we are not given n in advance. We work as before but we start with 
a constant estimate of n and, each time we have read that many characters of 
s, double it. This way, we increase the number of blocks by an 0(log n)-factor 
and the size of the largest block by an 0(l)-factor, so our asymptotic bound on 
whole encoding's length does not change. 

We now prove the lower bound. Suppose k = \(c + e/2) log CT n\ , d consists 
of all but the last k — 1 characters of a randomly chosen cr-ary de Bruijn se- 
quence of order k, and s consists of repetitions of d. Then Hk(s) = and 
0(nHk(s) + cr fc n 1_c_e ) — 0(n 1_e / 2 ), but cfs expected Kolmogorov complexity 
is at least \og(a\) ak 1 = Q(cr k ) = j?(rt c+e / 2 ), i.e., at least linear in gTs length and 
asymptotically greater than the memory we can use. Notice we can reconstruct 
d from the memory configurations when we start and finish reading a copy of d 
in s and the bits we output while reading that copy (if there were another string 
d' that took us between those two memory configurations while outputting those 
bits, then we could substitute d' for that copy of d without changing the overall 
encoding). Therefore, we output f2(a k ) bits for each copy of d in s, or Q(n) bits 
in total. □ 
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